ByteLevelBPETokenizer output seems weird
#ByteLevelBPETokenizer
https://github.com/huggingface/tokenizers/issues/203
The merge.txt and vocab.json files I obtained are now not human readable.
回答 https://github.com/huggingface/tokenizers/issues/203#issuecomment-605105611
The byte-level BPE converts all the Unicode code points into multiple byte-level characters:
「Unicodeコードポイント全てを複数のバイトレベルの文字に変換する」
1. Each Unicode code point is decomposed into bytes (1 byte for ASCII characters, and up to 4 bytes for UTF-8 Unicode code points)
2. Each byte value gets a "visible" character assigned to it from the beginning of the Unicode table.
So some characters get other representations, like for example the white space U+0020 becomes Ġ.
半角スペース U+0020 は文字に対して別の表現が割り当てられた例